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ABSTRACT 

The  NYU  Ultracomputer  is  an  architecture  for  a  large  scale  MIMD  (Multiple 
Instruction  stream,  Multiple  Data  stream)  shared  memory  parallel  computer 
that  may  be  viewed  as  a  column  of  processors  and  a  column  of  memory 
modules  connected  by  a  rectangular  network  of  enhanced  two  by  two 
buffered  crossbars.  TTie  primary  novelty  of  the  design  is  the  ability  of  the 
network  to  combine  multiple  requests  directed  at  the  same  memory  location, 
including  a  new  coordination  request,  fetch-and-add.  This  permits  task 
coordination  to  be  achieved  in  a  highly  parallel  manner.  For  example,  if  an 
arbitrary  number  of  tasks  simultaneously  issue  inserts  or  deletes  to  a  single 
shared  queue  that  is  neither  empty  nor  full,  then  all  these  inserts  and  deletes 
are  accomplished  in  essentially  the  same  time  as  would  be  required  for  a 
single  insert  or  delete. 

This  report  reviews  the  Ultracomputer  architecture  and  system  design  and 
describes  the  VLSI  enhanced  buffered  crossbars  that  are  the  key  to  the 
highly  parallel  behavior  mentioned  above. 

Consider  a  powerful  machine  composed  of  thousands  of  processors  and 
gigabytes  of  memory.  With  10-20  MIPS  (including  floating  point)  and  a  megabyte 
of  memory  soon  to  be  available  on  a  dozen  chips,  such  a  configuration  could  be 
built  to  yield  significantly  more  performance  than  current  supercomputers  with 
roughly  the  same  component  count.    Moreover,  due  to  replication  of  parts,  the 
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number  of  distinct  components  would  be  quite  small. 

Hardware  design  and  assembly  of  a  multiprocessor  with  a  very  high  degree  of 
parallelism  therefore  poses  no  new  problems.  However,  actually  using  all  the 
processing  power  that  can  theoretically  be  generated  presents  a  two-fold  challenge. 
First,  several  thousand  processors  must  be  coordinated  in  such  a  way  that  their 
aggregate  power  is  applied  to  useful  computation.  Serial  procedures  in  which  one 
processor  works  while  the  others  wait  become  bottlenecks  that  drastically  reduce  the 
power  obtained.  The  cost  of  serial  bottlenecks  rise  linearly  with  the  number  of 
processors  involved;  in  any  highly  parallel  architecture,  they  mmt  be  eliminated. 
Second,  the  machine  must  be  programmable  by  humans.  High  degrees  of 
parallelism  require  simple  languages  and  easy-to-use  facilities  for  designing,  writing, 
and  debugging  parallel  programs  in  order  to  be  effectively  used. 

Our  group  has  proposed  [5]  that  the  hardware  and  software  design  of  a  highly 
parallel  computer  should  meet  the  following  goals. 

•  Scaling.  Effective  performance  should  scale  upward  to  a  very  high  level. 
Given  a  problem  of  sufficient  size,  an  n-fold  increase  in  the  number  of 
processors  should  yield  a  speedup  factor  of  almost  n. 

•  General  purpose.  The  machine  should  be  capable  of  efficient  execution  of  a 
wide  class  of  algorithms,  displaying  relative  neutrality  with  respect  to 
algorithmic  structure  or  data  flow  pattern. 

•  Programmability.  High-level  programmers  should  not  have  to  consider  the 
machine's  low -lev  el  structural  details  in  order  to  write  efficient  programs. 
Programming  and  debugging  should  not  be  substantially  more  difficult  than  on 
a  serial  machine. 

•  Multiprogramming.  The  software  should  be  able  to  allocate  processors  and 
other  machine  resources  to  different  phases  of  one  job  and/or  to  different  user 
jobs  in  an  efficient  and  highly  djoiamic  way. 

Achievement  of  these  goals  requires  an  integrated  hardware/software  approach. 
The  burden  on  the  system  designer  is  to  support  a  high-level  and  flexible 
programming  model  to  enable  the  development  of  software  which  schedules  the 
processors  so  that  the  workload  is  balanced,  and,  most  importantly,  avoids  critical 
sections  that  would  constitute  unacceptable  serial  bottlenecks. 

The  next  section  reviews  the  MIMD  shared  memory  computational  model  on 
which  the  Ultracomputer  is  based.  As  indicated  below,  this  model  is  not  realizable 
in  hardware.  In  particular  the  postulated  single  cycle  access  to  shared  memory  must 
be  weakened  to  permit  access  via  a  processor  to  memory  interconnection  network 
having  logarithmic  latency.  A  hardware  design  closely  approximating  this  model  is 
sketched  in  section  three  with  the  interconnection  network  described  in  the 
subsequent  section.  Section  five  discusses  selected  issues  in  the  design  of  the 
custom  VLSI  switches  to  be  used  in  this  network.  In  particular,  the  return  path 
(i.e.  the  memory  to  processor  direction)  design  is  presented.  We  conclude  with  a 
brief  report  of  the  current  status  and  future  plans  of  our  VLSI  effort. 
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1.   Ultracomputer  Architecture 

In  this  section  we  review  the  architectural  model  on  which  the  Ultracomputer  is 
based,  and  discuss  its  power.  Although  this  idealized  machine  is  not  physically 
realizable,  a  close  approximation  can  be  built.  Elements  of  the  actual  machine 
design  are  described  in  order  to  illustrate  integrated  hardware/software  mechanisms 
for  bottleneck-free  coordination  of  a  very  large  number  of  processors. 

1.1.   The  Model 

An  idealized  parallel  processor,  dubbed  a  "paracomputer"  by  Schwartz  [18]  and 
classified  as  a  WRAM  by  Borodin  and  Hopcroft  [1],  consists  of  a  number  of 
autonomous  processing  elements  (PEs)  sharing  a  central  memory  (see  also  [6,19]). 
Every  PE  is  permitted  to  read  or  write  a  shared  memory  cell  in  one  cycle.  In 
particular,  simultaneous  reads  and  writes  directed  at  the  same  memory  cell  are 
effected  in  a  single  cycle. 

In  order  to  make  precise  the  effect  of  simultaneous  access  to  shared  memory 
we  define  the  serialization  principle,  which  states  that  the  effect  of  simultaneous 
actions  by  the  PEs  is  as  if  the  actions  had  occurred  in  some  (unspecified)  serial 
order.  Thus,  for  example,  a  load  simultaneous  with  two  stores  directed  at  the  same 
memory  cell  will  return  either  the  original  value  or  one  of  the  two  stored  values, 
possibly  different  from  the  value  which  the  cell  finally  comes  to  contain.  Note  that, 
in  this  model,  simultaneous  memory  updates  are  in  fact  accomplished  in  one  cycle; 
the  serialization  principle  speaks  only  of  the  effect  of  simultaneous  actions  and  not 
of  their  implementation. 

In  an  actual  hardware  implementation,  single  cycle  access  to  globally  shared 
memory  is  not  possible  to  achieve.  For  any  technology  there  is  a  limit,  say  b,  on 
the  number  of  signals  that  one  can  fan  in  at  once.  Thus,  if  N  processors  are  to 
access  even  a  single  bit  of  shared  memory,  the  shortest  access  time  possible  is 
log^A^.  As  will  be  seen,  hardware  achieving  this  logarithmic  access  time  has  been 
designed,  but  cannot  use  off  the  shelf  components.  A  custom  VLSI  design  is 
needed  for  switching  components  in  the  processor  to  memory  interconnection 
network.  In  addition  to  increasing  the  design  time,  the  network  adds  to  replication 
cost  and  size.  For  a  fixed  number  of  dollars  (or  cubic  feet,  or  BTUs,  etc.),  such  a 
shared  memory  design,  achieving  the  minimum  memory  access  time,  will  contain 
fewer  processors  or  memory  cells  than  will  a  strictly  private  memory  design 
constructed  from  the  same  technology  and  requiring  no  additional  components  for 
the  communication  network. 

Although  we  believe  that  the  lower  peak  performance  inherent  in  shared 
memory  designs  is  adequately  compensated  for  by  their  increased  flexibility  and 
generality,  this  issue  has  not  been  settled.  Most  likely  the  answer  will  prove  to  be 
so  application  dependent  that  both  shared  and  private  memory  designs  will  prove 
successful. 
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1.2.  The  Fetch-and-add  Operation 

We  augment  the  paracomputer  model  with  the  "fetch-and-add"  (F&A) 
operation,  a  powerful  interprocessor  synchronization  operation  that  permits  highly 
concurrent  execution  of  operating  system  primitives  and  application  programs  (see 
Gottlieb  and  Kruskal  [9]).  Fetch-and-add  is  essentially  an  indivisible  add  to 
memory;  its  format  is  F&A(X,e),  where  X  is  an  integer  variable  and  e  is  an  integer 
expression.  The  operation  is  defined  to  return  the  (old)  value  of  X  and  to  replace  X 
by  the  sum  V+e.  Moreover,  concurrent  fetch-and-adds  are  required  to  satisfy  the 
serialization  principle  enimciated  above.  Thus  fetch-and-add  operations 
simultaneously  directed  at  X  would  cause  X  to  be  modified  by  the  appropriate  total 
increment  while  each  operation  yields  the  intermediate  value  of  X  corresponding  to 
its  position  in  this  order.  The  following  example  illustrates  the  semantics  of  fetch- 
and-add:  Consider  several  PEs  concurrently  executing  F&A(/,1),  where  7  is  a 
shared  variable  used  to  index  into  a  shared  array.  Each  PE  obtains  an  index  to  a 
distinct  array  element  (although  one  cannot  predict  which  element  will  be  assigned 
to  which  PE),  and  I  receives  the  appropriate  total  increment. 

Fetch-and-add  is  a  special  case  of  Gottheb  and  Kruskal's  more  general  fetch- 
and-<J>  operation  (where  4)  may  be  an  arbitrary  binary  associative  operator)  [9]. 
Both  of  the  classic  test-and-set  and  compare-and-swap  synchronization  operations 
are  also  special  cases  of  fetch-and-(j>  and  the  familiar  load  and  store  operations  are 
degenerate  cases.   For  example,  Test&Set(,S)  is  just  Fetch&OR(5, r/^L/E). 

1.3.  The  Power  of  Fetch-and-add 

Using  the  fetch-and-add  operation  we  can  perform  many  important  algorithms 
in  a  completely  parallel  manner,  i.e.  without  using  any  critical  sections.  For 
example,  as  indicated  above,  concurrent  executions  of  F&A(/,1)  yield  consecutive 
values  that  may  be  used  to  index  an  array.  If  this  array  is  interpreted  as  a 
(sequentially  stored)  queue,  the  values  returned  may  be  used  to  perform  concurrent 
inserts;  analogously  F&A(D,1)  may  be  used  for  concurrent  deletes.  The  complete 
queue  algorithms  contain  checks  for  overflow  and  underflow,  collisions  between 
insert  and  delete  pointers,  etc.  (see  Gottlieb  et  al.  [10]).  We  are  unaware  of  any 
other  completely  parallel  solutions  to  this  problem.  To  illustrate  the  nonserial 
behavior  obtained,  we  note  that  given  a  single  queue  that  is  neither  empty  nor  full, 
the  concurrent  execution  of  thousands  of  inserts  and  thousands  of  deletes  can  all  be 
accomplished  in  the  time  required  for  just  one  such  operation. 

2.   Hardware  Realization 

As  indicated  above,  our  computational  model  is  not  physically  realizable,  due 
to  fan-in  (and  other)  limitations.  Furthermore,  memory  modules  operate 
sequentially;  only  one  load  or  store  may  be  satisfied  in  one  cycle.  If  concurrent 
fetch-and-add  or  load  operations  were  to  be  serialized  at  the  memory  of  a  real 
parallel  computer,  we  would  lose  the  advantage  of  parallel  coordination  algorithms, 
having  merely  moved  the  critical  sections  from  the  software  to  the  hardware. 
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In  fact,  a  parallel  processor  closely  approximating  our  idealized  paracomputer 
can  be  built.  The  NYU  Ultracomputer  uses  a  message  switching  network  with  the 
topology  of  Lawrie's  [15]  H-network  (equivalently,  the  SW  Banyan  of  Goke  and 
Lipovsky  [7])  to  connect  N  =  2°  autonomous  PEs  to  a  central  shared  memory 
composed  of  N  memory  modules  (MMs).  Note  that  the  direct  single  cycle  access  to 
shared  memory  characteristic  of  paracomputers  is  approximated  by  an  indirect 
access  via  a  multicycle  connection  network. 

Figure  1  gives  a  block  diagram  of  the  machine.  The  remainder  of  this  section 
sketches  the  design  of  the  processors,  memory  modules,  and  caches  (see  [4,8]  for  a 
more  detailed  description).  The  connection  network  is  described  in  subsequent 
sections. 

The  Ultracomputer  design  places  few  constraints  on  the  processors  and  memory 
modules;  for  example  we  take  no  stand  on  the  RISC-CISC  debate.  Naturally,  the 
fetch-and-add  instruction  is  needed.  In  addition,  the  presence  of  a  high  bandwidth 
memory  having  non-negligible  latency  strongly  favors  processors  that  permit 
prefetching  of  instructions  and  operands.  Although  issued  by  the  processor,  fetch- 
and-add  operations  are  effected  in  the  MMs.  When  F&A(Ar,e)  reaches  the  MM 
containing  X,  the  value  of  X  and  the  transmitted  e  are  brought  to  the  MM  adder, 
the  sum  is  stored  in  X,  and  the  old  value  of  X  is  returned  through  the  network  to 
the  requesting  PE. 

The  impact  of  network  latency  may  be  reduced  by  implementing  a  cache  with 
each  PE,  thereby  permitting  single-cycle  access  to  frequently-used  instructions  and 
data  and  reducing  network  traffic.  Unfortunately,  caching  of  read-write  shared 
variables  presents  a  coherence  problem  among  the  various  caches.  An  obvious 
method  of  eliminating  this  problem  is  to  simply  not  cache  read-write  shared 
variables  and  have  the  software  distinguish  between  shared  and  private  variables, 
typically  by  grouping  them  into  separate  memory-management  segments.  A  more 
elaborate  scheme  is  based  on  the  observation  that  if,  during  a  particular  code 
segment,  a  shared  variable  is  accessed  read-only,  or  accessed  only  privately,  then 
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Fig.  1  -  Ultracomputer  Architecture 
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this  variable  may  be  cached  during  execution  of  that  segment  [16]. 

3.   Network  Design 

The  manner  in  which  an  O-network  can  be  used  to  implement  memory  loads 
and  stores  is  well  known  and  is  based  on  the  existence  of  a  (imique)  path  connecting 
each  PE-MM  pair.  This  section  describes  the  overall  design  of  the  network  while 
the  subsequent  section  focuses  on  the  individual  switching  nodes. 

3.1.  Combining  Memory  Requests 

When  concurrent  loads  and  stores  are  directed  at  the  same  memory  cell  and 
meet  at  a  switch,  they  can  be  combined  without  introducing  any  delay  (see 
Klappholz  [13]  and  also  [10]).  Combining  requests  reduces  communication  traffic 
and  thus  decreases  the  lengths  of  the  queues  within  the  switches,  leading  to  lower 
network  latency  (i.e.  reduced  memory  access  time).  Since  combined  requests  can 
themselves  be  combined,  the  network  satisfies  the  key  property  that  any  nimiber  of 
concurrent  memory  references  to  the  same  location  can  be  satisfied  in  the  time 
required  for  one  central  memory  access.  It  is  this  property,  when  extended  to 
include  fetch-and-add  operations,  that  permits  the  bottleneck-free  implementation  of 
many  coordination  protocols. 

Since  fetch-and-add  is  our  sole  synchronization  primitive  and  is  also  a  key 
ingredient  in  many  algorithms,  concurrent  fetch-and-add  operations  will  often  be 
directed  at  the  same  location.  Thus,  as  indicated  above,  it  is  crucial  that  a  design 
supporting  large  numbers  of  processors  not  serialize  this  activity  (see  Pfister  and 
Norton  [17]).  Enhanced  switches  permit  the  network  to  combine  fetch-and-adds 
with  the  same  efficiency  as  it  combines  loads  and  stores.  When  two  fetch-and-adds 
referencing  the  same  shared  variable,  say  F&A(Ar,e)  and  F&A(Ar/),  meet  at  a 
switch,  the  switch  forms  the  sum  e+f,  transmits  the  combined  request 
F&A(X,e-l-/),  and  stores  the  value  e  in  its  local  memory.  When  the  value  Y  is 
retiimed  to  the  switch  in  response  to  F&A(X,e-h/),  the  switch  transmits  Y  to  satisfy 
the  original  request  F&A(X,e)  and  transmits  Y+e  to  satisfy  the  original  request 
F&A(X/).  Assuming  that  the  combined  request  was  not  further  combined  with  yet 
another  request,  we  would  have  Y=X;  thus  the  values  returned  by  the  switch  are  X 
and  X+e,  thereby  effecting  the  serialization  order  "¥&.A{X,e)  followed 
immediately  by  F&A{X  J)" .  The  memory  location  X  is  also  properly  incremented, 
becoming  X+e+f.  If  other  fetch-and-add  operations  updating  X  are  encountered, 
the  combined  requests  are  themselves  combined,  and  the  associativity  of  addition 
guarantees  that  the  procedure  gives  a  result  consistent  with  the  serialization 
principle. 

3.2.  Message  Transmission 

To  ensure  adequate  network  performance  in  large  systems,  a  major  goal  in  the 
design  of  the  network  is  to  attain  a  bandwidth  proportional  to  the  number  of  PEs. 
This  has  been  achieved  by  use  of  the  following  techniques: 
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•  Queues  are  associated  with  each  switch  to  allow  concurrent  processing  of 
requests  for  the  same  port  whenever  possible.  The  alternative  adopted  by 
Burroughs  [12]  of  killing  one  of  the  two  conflicting  requests  limits  bandwidth 
toO(N/logN);see[14]. 

•  Paths  through  the  network  are  not  maintained  while  awaiting  memory 
responses.  Thus,  the  interval  between  messages  is  the  switch  cycle  time,  rather 
than  the  network  transit  time. 

•  Flow  control  information  is  computed  and  transmitted  in  parallel  with 
messages. 

A  major  constraint  on  network  performance  is  the  delay  inherent  in  off-chip 
communication  between  VLSI  switching  nodes,  rather  than  the  rate  at  which 
information  can  be  processed  within  each  node.  Therefore,  significant  amounts  of 
logic  can  be  added  to  each  node  with  advantage  when  that  logic  would  help  avoid 
global  signaling  and  reduce  bottlenecks  within  the  network. 

The  number  of  chips  required  to  implement  each  switching  node  is  determined 
mostly  by  the  hig'h  pin  count  required  at  each  node,  rather  than  the  silicon  area  of 
the  switching  logic.  Therefore,  messages  must  be  spht  into  multiple  packets  and 
one  of  two  methods  can  be  used  to  transmit  these  packets  through  the  network. 
The  first  is  a  bit-sliced  implementation  in  which  different  components  are  handling 
different  packets  of  one  message  (transmission  of  messages  is  "space-multiplexed"). 
Or  the  transmission  of  successive  packets  of  a  message  can  be  time-multiplexed  to 
the  same  component. 

Space-multiplexing  provides  a  higher  bandwidth  than  time-multiplexing  at  the 
expense  of  more  components.  However,  a  large  amount  of  "horizontal" 
communication  and  coordination  must  then  take  place  between  the  different 
components  of  a  switch,  as  routing  and  combining  decisions  have  a  global  effect. 
This  further  increases  both  the  complexity  of  such  implementation  and  the  switch 
cycle  time.  For  MOS  technologies,  the  off -chip  delays  impose  an  especially  high 
overhead. 

Several  cycles  are  required  to  transmit  each  message  if  time-multiplexing  is 
used.  However,  the  internal  logic  of  the  switch  can  be  pipelined  so  that  messages 
can  be  handled  on  a  per-packet  basis  and  do  not  have  to  be  assembled  at  each 
switch.  Thus  there  can  be  as  little  as  one  cycle  delay  per  switch  for  each  request 
when  queues  are  empty  and  hence  time-multiplexing  contributes  an  additive  term  to 
the  delay  rather  than  a  multiplicative  factor.  However,  queuing  delays  increase 
multiplicatively  with  the  multiplexing  factor,  so  that  the  performance  of  the 
network  under  heavy  load  may  be  seriously  impaired  [14] .  In  the  current  design  we 
have  chosen  to  use  time-multiplexing,  so  that  each  message  is  divided  into  one 
packet  containing  the  path  descriptor,  address  and  opcode,  plus  one  or  more  data 
packets^. 


^At  the  expense  of  a  severe  increase  in  complexity,  the  address  can  also  be  transmitted  in  more 
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In  addition,  we  assume  that  both  the  MM  and  PE  numbers  are  transmitted. 
With  additional  internal  switch  complexity,  the  two  D  bit  numbers  can  be 
transmitted  as  a  single  D  bit  amalgam  [8] . 

The  protocol  used  to  transmit  messages  between  switches  is  a  message-level 
rather  than  packet-level  protocol.  That  is,  packet  transmission  cannot  be  halted  in 
the  middle  of  a  message.  A  switch  will  accept  a  new  message  only  if  the  available 
space  in  its  queues  guarantees  that  it  will  be  able  to  receive  the  entire  message. 

4.   Switch  Structure 

Each  network  switch  is  a  2x2  bidirectional  routing  device.  The  goals  in  the 
design  of  the  switch  are  the  following: 

•  Distinct  data  paths  do  not  interfere  with  each  other.  Therefore,  a  new  message 
can  be  accepted  at  each  input  port  provided  queues  are  not  full.  In  addition,  a 
message  destined  to  leave  at  some  output  port  will  not  be  prevented  from  doing 
so  by  a  message  routed  to  a  different  output  port. 

•  A  packet  entering  a  switch  with  empty  queues  when  no  other  message  is 
destined  for  the  same  output  port  leaves  the  switch  at  the  next  cycle. 

•  The  capability  to  combine  and  de-combine  memory  requests  should  not  unduly 
slow  the  processing  of  requests  that  are  not  to  be  combined. 

This  section  begins  with  an  overview  of  the  entire  switch  and  then  discusses  the 
protocols  used  for  flow  control.  Finally,  we  describe  the  logic  used  to  transmit 
responsed  from  the  MMs  to  the  PEs.  This  last  description  gives  considerable  detail 
since  the  material  has  not  been  published  elsewhere. 

4.1.   Overview 

Figiore  2  shows  a  block  diagram  of  a  switching  node.  The  "PE  port"  connects 
to  either  a  PE  or  to  an  MM  port  of  a  preceding  network  stage  and  the  "MM  port" 
connects  to  either  an  MM  or  a  PE  port  of  a  subsequent  network  stage. 

Associated  with  each  MM  port  is  a  combining  queue  capable  of  accepting  a 
packet  simultaneously  from  each  PE  port.  Requests  that  have  been  combined  with 
other  requests  are  sent  to  a  wait  buffer  at  the  same  time  as  the  combined  request  is 
sent  to  the  MM  port. 

From  each  MM  port  a  reply  enters  both  the  wait  buffer  associated  with  the 
MM  port  and  the  non-combining  queue  associated  with  the  PE  port  to  which  the 
reply  is  destined.  An  associative  look-up  is  performed  in  the  wait  buffer  to 
determine  if  the  reply  was  to  a  request  that  had  been  previously  combined  and,  if 
so,  the  de-combined  reply  is  sent  to  the  non-combining  queue  at  the  appropriate  PE 
port.  Each  non-combining  queue  has  four  inputs  since  messages  may  come  from 
both  MM  ports  and  from  both  wait  buffers. 


than  one  packet  [20]. 
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Figure  2  -  Block  Diagram  of  Switch 

For  packaging  reasons,  each  switch  is  divided  into  a  forward  path  component 
(FPC),  consisting  of  the  two  combining  queues,  and  a  return  path  component 
(RPQ  consisting  of  the  wait  buffers  and  non-combining  queues.  Data  forwarded  to 
a  wait  buffer  from  a  combining  queue  are  transmitted  from  the  FPC  to  the  RFC  via 
ports  called  wait  buffer  output  ports  (WBOPs)  and  wait  buffer  input  ports  (WBIPs) 
on  the  FPC  and  RPC,  respectively. 

The  combining  queues  used  in  the  FPC  are  an  enhancement  of  the  VLSI 
systolic  queue  of  Guibas  and  Liang  [11]  and  are  described  in  [2].  Further  details  on 
the  design  of  combining  queues  can,  be  found  in  [20].  The  design  of  the  RPC  will 
be  presented  later  in  this  section.  For  a  detailed  description  of  the  implementation 
of  a  network  for  a  planned  32-PE  prototype,  see  [3]. 

4.2.   Flow  Control 

The  construction  of  the  queues  requires  that  there  be  an  even  number  of 
packets  per  message  and  that  switches  distinguish  even  and  odd  cycles.  At 
initialization,  the  parity  of  the  cycle  is  the  same  as  the  parity  of  the  stage  to  which 
the  switch  belongs,  so  that  cycles  that  are  even  for  a  switch  are  odd  for  its 
predecessors  and  successors  while  the  cycle  parity  of  the  FPC  and  RPC  in  the  same 
switch  are  identical.  Reception  of  messages  starts  only  at  even  cycles  while 
transmission  of  messages  starts  only  at  odd  cycles. 

Each  port  consists  of  data  bits  and  two  protocol  bits:  a  data  valid  bit  (DV) 
traveling  in  the  same  direction  as  the  data  and  a  data  accept  bit  (DA)  traveling  in 
the  reverse  direction.  In  addition,  input  ports  receive  a  routing  (RO)  bit  whose 
value  at  the  first  cycle  of  a  message  transmission  indicates  to  which  output  port  the 
message  is  destined.   The  RO  bit  accompanying  a  packet  entering  a  PE  port  at  the 
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i-th  stage  of  the  network  is  the  i-th  most  significant  bit  of  the  MM  number;  at  an 
MM  port  and  WBIP  it  is  the  i-th  most  significant  bit  of  the  PE  number. 

The  two  protocol  bits,  in  conjunction  with  the  RO  bit,  regulate  the  transmission 
of  messages  through  the  network.  A  sender  asserts  DV  when  it  wishes  to  initiate  a 
message  transmission.  Independently,  a  receiver  asserts  DA  when  it  is  able  to 
accept  a  new  message.  A  message  transfer  starts  only  if  both  DV  and  DA  are 
asserted  and  the  cycle  parity  is  correct.  Since  these  control  signals  are  ignored 
during  cycles  when  a  message  transfer  cannot  be  started,  they  can  be  set  ahead  of 
time  to  overlap  data  transfer  and  flow  control  operations.  Note  that  this  is  not 
strictly  speaking  a  handshaking  protocol:  DA  is  not  an  answer  to  DV,  nor  an 
acknowledgment,  but  is  issued  independently  and  simultaneously.  The  sender  is 
transmitting  the  data  on  the  data  lines  whenever  DV  is  asserted.  If  it  receives  DA, 
it  assimies  the  data  has  been  accepted  and  proceeds  with  the  next  packet.  No 
provision  for  retry  is  necessary. 

4.3.    Return  Path  Component 

The  RPC  routes  responses  from  MMs  to  the  requesting  PEs.  When  a  response 
to  a  request  previously  combined  by  the  FPC  is  detected,  the  RPC  will  generate  an 
additional  response  for  the  other  requesting  PE.  Each  RPC  has  two  MM  (input) 
ports  (IPs)  and  two  PE  (output)  ports  (OPs).  In  addition,  two  wait  buffer  input 
ports  receive  information  from  the  FPC. 

An  RPC  contains  two  wait  buffers,  WBq  and  WBi,  one  associated  with  each 
MM  port  and  two  four-input  non-combining  queues,  which  are  implemented  as 
eight  single-input  non-combining  queues,  Qy;t.  O^i  j,^^l.  (To  enable  each  input 
to  accept  a  packet  every  cycle  and  to  prevent  a  blockage  of  one  output  from 
interfering  with  the  other  output,  one  queue  is  required  for  each  input/output  pair 
where  "inputs"  include  both  wait  buffer  and  input  ports.)  Queue  Qo;y  is  fed  from 
IP^  and  writes  on  OP^- ;  queue  Qi;y  is  fed  from  WB/  and  writes  on  OFj . 

A  message  received  on  IP/  starting  at  cycle  2/  with  routing  bit  set  to  j  is  sent  to 
Qoy  and  also  to  WB^  where  its  address  packet  is  compared  with  the  messages 
currently  in  the  wait  buffer.  If  a  match  is  found,  the  wait  buffer  asserts  its  match 
line  during  cycle  2t+l  but  defers  sending  its  generated  response  to  Q^j  until  cycle 
2r+2  so  that  queues  Qo/y  and  Qyj  receive  the  first  packets  of  their  messages  at 
cycles  of  the  same  parity. 

The  DA  signal  can  only  be  asserted  at  IP,  if  there  is  at  least  one  empty  slot  in 
each  of  the  two  queues  Qo,o  and  Qo/i.  In  addition,  there  must  be  sufficient  room  in 
queues  Qi,o  and  Qm  for  messages  from  WB,  corresponding  to  both  the  last  message 
received  at  IP,  and  the  current  message.  Therefore,  DA  also  cannot  be  asserted 
unless  WB,  does  not  assert  match  and  there  is  one  empty  slot  in  each  queue  Qi,;^  or 
WB,  asserts  match  and  there  are  two  empty  message  slots  in  each-^  of  these  queues. 


^Since  the  destination  of  the  message  in  WB,  is  known  on-chip  at  this  cycle,  this  test  can  be 
refined  to  require  only  two  empty  slots  in  the  queue  that  is  the  destination  of  the  message. 
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To  arbitrate  between  the  four  queues  Q^;^  that  can  send  data  to  OP;k,  the  RFC 
keeps  track  of  when  each  queue  has  last  sent  a  message.  Of  the  queues  that  are 
non-empty,  the  one  having  sent  least  recently  is  selected. 

The  DA  signal  is  asserted  at  a  wait  buffer  input  port  if  the  wait  buffer  will  have 
an  available  slot  to  receive  a  message.  As  will  be  seen  below,  a  slot  in  the  wait 
buffer  is  capable  of  simultaneously  receiving  and  transmitting  a  message. 
Therefore,  DA  will  be  asserted  on  WBIP;  if  WB/  either  does  not  assert  the  full 
signal  or  asserts  the  match  signal. 

4.3.1.    Wait  Buffer 

The  wait  buffer  is  an  associative  memory  that  stores  information  sent  by  the 
FPC  when  combining  two  F&A's  into  a  single  request.  The  wait  buffer  inspects  all 
responses  from  MMs  and  searches  for  a  response  to  a  request  previously  combined 
by  the  FPC.  When  it  finds  a  response  to  such  a  request,  it  generates  a  second 
response  and  deletes  the  request  from  its  memory. 

The  structure  of  a  wait  buffer'*  (WB)  is  shown  in  Figure  3.  A  typical  message 
slot  is  shown  in  the  solid  black  box  and  consists  of  two  registers  (called  Areg  and 
Breg),  compare  logic,  and  a  controller.  Each  register  contains  the  data  bits,  a  data 
valid  (DV)  bit,  and,  for  the  first  packet  of  each  message,  a  routing  (RO)  bit.  The 
registers,  are  connected  in  a  loop  of  length  two,  and  shift  at  each  cycle.  The  Areg 
receives  the  address  packet  of  a  message  at  even  cycles  and  the  data  packet  at  odd 
cycles.  The  opposite  is  true  for  Breg.  Packets  are  stored  in  the  format  they  are 
received  from  the  WBIP  with  the  RO  bit  appended  to  the  address  packet  of  each 
message. 

Each  slot  connects  to  the  following  buses: 

•  The  write  bus  (Wbus)  is  used  to  send  data  to  the  wait  buffer  from  the  FPC  and 
connects  to  a  wait  buffer  input  port. 

•  The  read  bus  (Rhus)  is  used  by  each  slot  for  transmission  of  its  message  out  of 
the  wait  buffer. 

•  The  key  bus  (Kbus)  contains  the  search  key  received  from  an  MM  port  of  the 
RFC. 

The  next-slot  (NS)  line  is  a  one  bit  signal  that  is  passed  through  all  the  slots  in  a 
daisy-chain  fashion.  It  is  used  to  select  which  slot  will  receive  the  next  message 
from  the  FPC.  Each  slot  computes 

NSo„,  :=NS/n  and  not  empty 
and  the  end  of  this  signal,  which  has  passed  through  all  the  slots,  is  the  full  line  of 
the  wait  buffer. 

An  adder  is  used  to  generate  the  second  response  to  an  F&A  operation  by 
summing  the  data  packet  received  from  a  slot  and  the  data  packet  received  from  an 


"This  structure  requires  each  message  to  consist  of  a  single  address  packet  followed  by  a  single 
data  packet.    Similar  structures  support  messages  containing  a  fixed  even  number  of  packets. 
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Figure  3  -  Wait  Buffer 

MM.  It  passes  address  packets  unchanged.  Since  the  wait  buffer  is  not  required  to 
respond  until  two  cycles  after  it  receives  a  matching  memory  response,  the  adder 
can  take  a  full  cycle. 

The  Dreg  is  connected  to  the  MM  port.  It  is  loaded  on  odd  cycles  with  the  data 
packet  of  a  message  and  presents  that  packet  to  the  adder  on  the  next  cycle,  which 
forwards  the  simi  to  the  appropriate  RFC  queue  on  the  cycle  after  that. 

Each  slot  is  fiill  or  empty  depending  on  the  DV  bit  in  its  Areg.  For  each  pair  of 
cycles  2t  and  2t+l,  each  wait  buffer  slot  simultaneously  performs  the  following 
operations: 

•  If  DV  is  present  on  the  Wbus,  NS;„  is  on^,  and  the  slot  is  empty,  the  Areg  is 
loaded  with  the  address  packet  of  the  message  at  cycle  2t.  During  cycle  2/+1 
the  Breg  receives  the  address  packet  and  the  Areg  receives  the  data  packet. 

•  If  the  slot  is  full,  the  data  in  the  Areg  and  the  data  on  the  Kbus  are  compared 
during  cycle  2t.  If  the  DV  bit  is  present  on  the  Kbus,  the  operation  on  the 
Kbus  is  an  F&A,  and  the  address  and  PE  number  are  the  same  in  the  Areg  as 


^As  a  performance  optimization,  a  slot  can  load  its  registers  whenever  it  is  empty.    Its  DV  bit  is 
then  set  from  the  logical  AND  of  the  DV  bit  on  the  Wbus  and  the  NS/„  signal. 
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on  the  Kbus,  the  slot  asserts  match.  (Note  that  only  one  slot  can  detect  a  match 
because  the  combination  of  address  and  PE  number  uniquely  identify  each 
combinable  request  in  the  network  [3].)  If  a  match  is  detected,  the  slot  will 
present  its  message  on  the  Rhus  at  cycles  2t+l  and  2t+2.  It  will  also  be 
marked  as  empty  during  cycle  2t+l  so  that  it  can  begin  accepting  a  subsequent 
message  at  cycle  2/+ 2. 

If  any  slot  asserts  match  during  cycle  2t,  the  wait  buffer  will  assert  match  at  cycle 
2t+ 1.  The  packet  presented  by  a  slot  to  the  Rhus  on  cycles  2t+ 1  and  2/+ 2  will  be 
presented  to  Qyj  on  cycles  2t+2  and  2t+3  after  being  processed  by  the  adder. 

5.   VLSI  Design  Status 

In  preparation  for  the  design  of  a  complete  combining  switch  chip,  we  have 
designed  several  chips  which  have  been  fabricated  by  DARPA's  MOSIS  facility. 

We  have  received  fvmctional  11-bit  wide  2x2  non-combining  switch  chips 
containing  approximately  7500  transistors  and  fabricated  in  3-micron  NMOS.  These 
parts  operate  at  a  clock  speed  of  23mHz  with  propagation  delays  from  clock  to 
output  of  approximately  25ns..  Power  dissipation  is  approximately  1.5W.  A  4x4 
test  network  was  constructed  using  four  of  these  parts  and  functioned  as  expected. 

We  have  also  had  a  6-bit  wide  portion  of  the  FPC  (without  the  adder)  for  a  2x2 
combining  switch  fabricated  in  4-micron  NMOS.  This  switch  is  composed  of  four 
1-input  combining  queues.  These  parts  also  operate  as  expected  and  have 
performance  and  power  dissipation  sirnilar  to  the  non-combining  switches. 

Since  the  final  combining  switches  must  be  at  least  32-bits  wide  and  air-cooled, 
we  have  converted  our  design  effort  to  the  newly  available  scalable  double-metal 
CMOS  process,  which  promises  minimum  feature  sizes  as  small  as  1.6  microns.  We 
have  submitted,  and  are  awaiting  the  fabrication  of,  a  35-bit  non-combining  switch 
using  this  CMOS  technology. 

We  have  also  designed  a  fast  32-bit  adder  and  plan  to  have  a  completed  FPC  by 
the  end  of  the  current  academic  year. 
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